Introduction

Consider the case where you upload a file to a server and later want to retrieve it. How can you be sure that the file is the same as the file you originally uploaded? Perhaps someone hacked the server in the meantime and tampered with the file - how can you detect this?

Well, a (gormless) solution would be to simply store a copy of on your local machine and then check if the file returned from the server matches your local copy. For one, this verification might take a while to finish depending on the size of the file, and, secondly, having to maintain a local copy defeats the entire purpose of using the server for storage.

Another thing you could do is to hash the file with a collision resistant hash function and store only its digest . Later on, when retrieving the file from the server, you can simply check if the hash of the server's file matches the hash which you stored on your system. This is indeed an excellent solution for single files, but what about the case when multiple files are involved?

Merkle Trees

Merkle trees provide a way to solve this very problem. More generally, whenever one has different components that comprise some object , a Merkle tree can be used to verify both the integrity of the entire object as well as that of its individual components.

Suppose you have different files where is a power of 2 for simplicity (otherwise you can just use additional dummy files until becomes a power of 2). The first step is to hash each of the files to obtain their corresponding hashes . Next, divide the hashes into pairs according to their adjacency - . Concatenate the elements of each pair and hash the results. This process is repeated until there is only a single hash left which is what you store on your machine.

Later, when you are retrieving a specific file from the remote host, the server will send you the file together with the hashes necessary to calculate . For example, if you are requesting , then the server will return a file together with the hashes , , and . You are going to hash to obtain and then use this with to compute . This can now be used to calculate and subsequently . If this new root hash , which is based on the server's information, matches the root hash , which you computed when uploading the files, then you know that the file has not been tampered with!

In fact, you know that *no* file has been tampered with on the server's end because all the files are taken into account when the server sends you the hashes. If one of these hashes is not correct, then neither will be the root hash.